Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the order of local tags #26

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft

Change the order of local tags #26

wants to merge 4 commits into from

Conversation

SaulLu
Copy link
Collaborator

@SaulLu SaulLu commented Sep 6, 2021

I propose in this PR to change the way in which local tags are added to the text.

Motivation

HTML defines a tree. In this sense, it is important that the rules for opening and closing tags are respected. For example, the following html is not valid "<h1> test </h1></a><a>". With the current way of adding local tags, this rule may not be respected because of self closing tags which can have their opening and closing tags on the same character index.

As a side note, with the current metadata format, one level of information is lost for HTM tags. Indeed, when two tags have the same index for their opening and closing tag, we don't know which one is the parent.

Implementation details

  • I have modified the way local tags are added
  • I added a test
  • I also modified the HTMLProcessor because HTML tags have a tag value but also attributes.

Review

I would be very happy to have your opinion on this PR

PROCESSORS["html"] = HtmlProcessor

text1, mask1 = add_local_metadata_to_text(self.examples[3], cfg)
target_text = '<a>useless text </a><div><a><div><div></div></div></a></div><h1><i>The Walking Dead</i> (season 8)</h1>\n'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the change I propose in this PR the result was here:

<a>useless text </div></a></div></a></div><h1><i><div><a><div><div>The Walking Dead</i> (season 8)</h1>

@SaulLu SaulLu changed the title [WIP] Changing the order of local tags Changing the order of local tags Sep 6, 2021
@SaulLu SaulLu marked this pull request as ready for review September 6, 2021 09:31
Copy link
Collaborator

@tianjianjiang tianjianjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tianjianjiang
Copy link
Collaborator

As a side note, with the current metadata format, one level of information is lost for HTM tags. Indeed, when two tags have the same index for their opening and closing tag, we don't know which one is the parent.

Just curious, is this (same index) a limitation of the source dataset?

@SaulLu
Copy link
Collaborator Author

SaulLu commented Sep 6, 2021

Just curious, is this (same index) a limitation of the source dataset?

No no, I compile the dataset myself from Natural Questions and put it in the format proposed here.

@tianjianjiang
Copy link
Collaborator

Just curious, is this (same index) a limitation of the source dataset?

No no, I compile the dataset myself from Natural Questions and put it in the format proposed here.

Ah, I see. I misunderstood. Thank you for the clarification!
I suppose if we truly want to resolve the nested-tag-situation then there may be a bit too much work for your HTML parser?

@SaulLu
Copy link
Collaborator Author

SaulLu commented Sep 6, 2021

I suppose if we truly want to resolve the nested-tag-situation then there may be a bit too much work for your HTML parser?

🙂

I suppose if we truly want to resolve the nested-tag-situation then there may be a bit too much work for your HTML parser?

On the parser side I think it's really feasible, what's more blocking in my opinion is where to store the information in the jsonl format we have defined and how to use it in the add_local_metadata_to_text method without it being something specific to HTML.

@SaulLu SaulLu changed the title Changing the order of local tags Change the order of local tags Sep 7, 2021
@tianjianjiang
Copy link
Collaborator

what's more blocking in my opinion is where to store the information in the jsonl format we have defined and how to use it in the add_local_metadata_to_text method without it being something specific to HTML.

I see. Funny you should mention that, because my first comment was gonna be "having a new property or even structure of jsonl for it" but then I didn't send it for exactly the same reason. lol

That being said, I wonder if entities will also need that kind of nested construct.

@timoschick
Copy link
Contributor

Hi @SaulLu, as discussed yesterday, I fully agree that this is an important issue that needs to be fixed. However, I'm not sure I fully understand how your proposed solution works - would you perhaps have 15 minutes or so sometime tomorrow (ideally, between 15:00 and 17:00 CEST) to go through this pull request with me? :)

@SaulLu
Copy link
Collaborator Author

SaulLu commented Sep 10, 2021

Summary of an offline discussion: unfortunately this PR does not solve all the concerns about the order of the local HTML metadata to be added.

For example, @timoschick has rightly shown that <tr><th>elephant</th><th></th></tr> would be transformed into <tr><th>elephant</th></tr><th></th> with this PR (which is still not correct).

Another pathological case is:

<table>
  <tr>
    <th>event_id</th>
    <th>year</th>
    <th>month</th>
  </tr>
  <tr>
    <td>idghtu</td>
    <td>1998</td>
    <th>may</th>
  </tr>
  <tr>
    <td></td>
    <td></td>
    <td></td>
  </tr>
</table>

I'm converting this PR into a draft, for now, the time to think about another workaround.

@SaulLu SaulLu marked this pull request as draft September 10, 2021 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants